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Position of the problem 

Design, development and engineering of industrial power burners have strong 
mathematical requests: 

• numerical resolution of a PDEs system involving Navier-Stokes equations 
for velocity and pressure fields, energy conservation law for temperature 
field, Fick's law for diffusion of all the chemical species in the combustion 
chamber; 

• geometrical design of the combustion head for a correct shape and optimal 
efficiency of flame; 

• geometrical design of ventilation fans and computation of a correct air 
inflow for optimal combustion. 




Figure 1: Combustion head and chamber for burner. 



Computational complexity analysis 
for a flow (i) 

Simple example for a detailed knowledge of the velocity field of fluid particles 
in the combustion chamber: 

• M is the number of flow streamlines to compute; 

• S is the number of geometrical points for every streamline. 

High values for M arc important for a realistic simulation of the flow, high 
values for S are important for a fine graphic resolution: minimal values are of 
order O(10 3 - 10 4 ). 

Suppose to use a 3D grid 10 x 10 x 1000 cm (hence M = 100, S = 1000), a 
medium value v; t = 50 cm/sec for every cartesian component of velocity vector 
field, and a space resolution h = 0.5 cm. 
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Computational complexity analysis 
for a flow (2) 

For numeric resolution of time-dependent advective PDEs, the Courant-Friedrichs- 
Lewy (CFL) condition gives an upper limit for the time step: 

At< ^ 

— V 

where c is a costant, usually < 1, depending on the used numeric method, and 
v = sup \vj\. The quantity is called CFL number. Let c = 1 ; then 

At < Mm£ = 0.01 sec. 

sec 

As consequence, for 1 real minute of simulation the flops are of order O(10 10 ) 
and the occupation of RAM is 0(10°) GB: 

the computation is CPU expensive, RAM consuming and produces a lot of un- 
useful data (100 snapshots of the flow every second). 

A Finite Differences method and Interpolations 

In the effort of minimize the relevance of these problems, we have studied a 
numeric model based on 

• a Finite Differences schema with a not too restrictive CFL condition; 

• an appropriate interpolation of the numeric FD velocity-field for a finer 
resolution without modifying the grid step. 

This model gives a numeric solution comparable with the solutions based on finer 
grids: we present an estimate of its goodness and a mathematical justification. 
The FD method is based on Lax-Friedrichs schema: 

• discretization in time: 9 t u™ = -%i(u™ +1 — u"), 
where u™ <— + + 

(for a better approximation we compute the mean on three values, two in 
LF original form); 

• discretization in space: = ^(u™ +1 — «™_i); 

where u is a velocity component, n the time step, j a value on the cartesian 
coordinate x. 
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Computational aspects of Lax-Friedrichs schema 



For this schema the CFL condition has costant c = 1; the Finite Elements 
method with the same schema for discretization in time has a more restrictive 
costant c < 1. 

If K G R + , K < i, we can define the norm = K supj\uj |; then the modified 
LF schema is strongly stable: \\u n \\ < ||u™ _J || Vn S N; hence there is not the 
blowing up of the numeric solution. 

Suppose we want to compute at most 10 snapshots for every second; then, in 
the hypothesis v — 50 cm/sec as the previous example, from 

vAt < h 

we must use as minimum a grid step h = 5 cm. 

This case gives S = 200, the total flops for 1 minute of simulation is now of 
order O(10 8 ) and the occupation of RAM is of order 0(1O -2 ) GB. 
The gain is of order O(10 2 ). 

The grid step h = 5 cm is too big for a good resolution of streamlines for flows 
into the combustion head: for better final results, it can be useful a method 
based on interpolations of the computed LF values. 



Interpolation of trajectories (1) 

Every streamline of LF solution is divided into N couples of points, 
{(Pi, P 2 ), (P 2 , P 3 ), (Pn-i, Pn)}, so that S = N+l. 

We use for every couple a cubic polynomial (spline) imposing the following four 
analytical conditions (v is the LF solution): 

• passage at Pk point, 1 < k < N — 1; 

• passage at Pk+i point; 

• the first derivative at Pk is equal to v^; 

• the first derivative at Pk+i is equal to Vk+i ■ 

In this way we can construct a set of class C 1 new trajectories; we want to 
estimate 

1. the overload for finding and valuating all the cubics; 

2. the difference compared to the real LF solution of the smaller grid step. 
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Interpolation of trajectories (2) 



For simplicity, consider a single component of a cubic: 
s(t) = at 3 + bt 2 + ct + d, where < t < 1; 
if T is the 4x4 matrix 

/ 2 -2 1 1 \ 

-3 3 -2 -1 

10 

\ 1 / 

and (pi,P2,v\,V2) is the vector of cartesian coordinates and velocities compo- 
nents of points Pi and P 2 , we have 



(a,b,c,d) = T(pi,p 2 ,vi,v 2 ). 



Interpolation of trajectories (3) 



We define the 4M x 4M global matrix 



/TO 
T 



G 



\ 




\ . . T J 

where is the 4x4 zero-matrix. Then 

• G is a sparse matrix with density number < 

• if p = (P(i,i),P(i.2)j w (m,i)7 U (M,2)); we can compute the cubics, between 
two points, for all the M trajectories by the product Gp. 
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Interpolation of trajectories (4) 



The theoric number of flops for computing the coefficients of all the splines is of 
order O(10M 2 N). If M = 10 4 and N = 10 3 , the total number of flops is O(10 12 ). 

With a single processor having a clock frequency of 0(1) GHz, the total time 
can require some hundreds of seconds, a performance not very good for practical 
purposes; using 

• some mathematical libraries as LAPACK routines with Fortran calls or 
Mat lab environment, 

• distributed computation on a multinode cluster, 

we have reached a computation time of some tens of seconds. 

Example: Matlab has internal Lapack level 3 BLAS routines for fast matrix- 
matrix multiplication and treatment of sparse matrices. 



Interpolation of trajectories (5) 



Matrix-vector multiplication 



I Lapack sparse i i..- i plication ■.villi Matlab | 




Performances for a single Gp multiplication using an Intel Xeon 3.2 GHz with 
1 MB internal cache: for M=10 4 the memory occupied by the sparse version of 
G is only O(10 2 ) KB instead of theoric O(10 6 ): G can be stored in processor 
cache. 
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Computation of splines values (1) 



Now we need a fast method for computing the splines values in a set of param- 
eter ticks with fine sampling. 

Let r 6 N + the number of ticks for each cubic: then the values of the parameter t 
in these ticks are (0, 2 , 1); the value of a cubic at t is a scalar product: 

atl +bt% + ct + d= (a,b,c,d) ■ (tl,t%,t , 1). 

Consider the constant 4 x (r + 1) R matrix and the (M x 4) C matrix: 



R 



C 



/0 (I) 3 

o (i) 1 
V 1 1 



Cl 
C2 



1 

d 2 



1 
1 

1 y 



\ % &m cm c?m / 



Computation of splines values (2) 

Then the M x (r + 1) matrix CR contains for each row the values of a cubic 
between two points, for all the trajectories (eulerian method: computation of 
all the position and velocity at a fixed instant). The flops for one multiplication 
are of order O(10Mr). 

Tests with Xcon 3.2 GHz processor, M = 10 4 , r = 10 and GNU Fortran77 
show a time of 0.01 seconds for a multiplication. 

With N = 10 2 , the time for computing the values of all the splines of a single 
time step is 4-5 seconds (theoric for 3D: 0.01 x 10 2 x3 = 5 sees). 

If p is the number of available processors and mod(M, p) = 0, the computation 
can be parallelized distributing ^ rows of matrix C to each processor: there is 
no need of communication among processes. 

A version of High Performance Fortran on a SMP system with 4 Itaniumll 
processors shows a quasilinear speedup for M, N of order O(10 3 ). 
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Time for computation 



HPF on four Itanium II 

35 | 1 




1 50 100 

N 



These are the total time of computation for the two methods in the case of a 
cilinder of length L = lm, a flow with a max. speed v = lOcm/scc, M = 10 4 , r 
= 10 and 1 minute of real simulation. The space grid is h = j^. 

Estimate of LF+interpolations vs normal LF (i) 

But what is the difference between the modified LF solution and normal LF so- 
lution? 

Consider the one-dimensional case. Let u—(uk) the solution of normal LF 
schema with grid step h and initial value Uo; w=(w m ) the solution of normal 
LF schema with grid step s x h, s e N + , and initial value Wo C Uo; v=(i> n ) the 
solution of modified LF schema obtained by interpolation of w and valuation 
on s points per cubic; for a cubic, let Vk, k < s, the value of v at t=^ and 
the value of u at the corresponding node of the finer grid; 2^ the CFL number 
and N the iV-th time step. Let 

M = max \u , m - Uo,n\ 

| m — n | = 1 

Then it is possible to prove this result: 

Theorem 1 If Mg > 0, there are two positive constants A and B such that 

N v ^ 

\v n ~u n \ < (A + Bs)M ^2(——y Vn € {grid indexes}, VN e N. 
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Estimate of LF+interpolations vs normal LF (2) 

The CFL number is usually indicated by Xv. From the previous theorem it 
follows: 

Corollary 1 If Xv < 2, then 

. 2{A + Bs)M 
\v„ - u n \ < . 

2 — Xv 

The CFL condition satisfies the hypothesis of the corollary. 

Hence, for a realistic solution from the LF+interpolations model, the conditions 

are: 

• a small (<C 2) CFL number, 

• a not too big number s of valuations for the cubics; note that s has the 
inverse logical meaning of the previous r parameter. 

Note that if M is very big, as in the case of very caotic flows, the LF+interpolations 
solution can be not very realistic. 

Estimate of LF+interpolations vs normal LF (3) 

Testing the estimate: example for one-dimensional non linear Navier-Stokes 
equation, Xv = 1, s = 10, after N — 10 5 time steps; graphic of the error between 
LF+interpolations and normal LF solutions. 



Error on velocity-field 

600 I , , , , , , 




1 23456789 10 

M (mm/sec) 



In this case it can be shown that A — 8, i?=2isa first, not optimized, 
approximation for the two constants. The picture shows that the estimate is 
correct but large. 
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Conclusions 

The numeric LF schema can be modified using the interpolations method so 
that: 

• the time spent on computation is much lower than the time of the LF 
based on the corresponding finer grid; 

• the computation can be parallelized on multiprocessors environment with 
very reduced need of communication; 

• the error on normal LF solution can be estimated and depends on the 
initial value u of the problem; 

• the estimate is compatible with CFL condition. 



Thanks 
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